ntk theory
On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory
This paper aims to discuss the impact of random initialization of neural networks in the neural tangent kernel (NTK) theory, which is ignored by most recent works in the NTK theory. It is well known that as the network's width tends to infinity, the neural network with random initialization converges to a Gaussian process (f^{\mathrm{GP}}), which takes values in (L^{2}(\mathcal{X})), where (\mathcal{X}) is the domain of the data. In contrast, to adopt the traditional theory of kernel regression, most recent works introduced a special mirrored architecture and a mirrored (random) initialization to ensure the network's output is identically zero at initialization. Therefore, it remains a question whether the conventional setting and mirrored initialization would make wide neural networks exhibit different generalization capabilities.
Reviewer
We thank the reviewers for their kind and thoughtful comments on our work. Below, we respond to reviewer-specific comments. Our claim that a standard multilayer perceptron "fails to learn high frequencies in theory" is based on the theoretical For example, in the abstract, modifying "a standard MLP fails to learn high frequencies both We directly extend the 1D experiment in Figure 4 to a two-dimensional setting in Section 1.4 of the supplement, and
On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory
This paper aims to discuss the impact of random initialization of neural networks in the neural tangent kernel (NTK) theory, which is ignored by most recent works in the NTK theory. It is well known that as the network's width tends to infinity, the neural network with random initialization converges to a Gaussian process (f {\mathrm{GP}}), which takes values in (L {2}(\mathcal{X})), where (\mathcal{X}) is the domain of the data. In contrast, to adopt the traditional theory of kernel regression, most recent works introduced a special mirrored architecture and a mirrored (random) initialization to ensure the network's output is identically zero at initialization. Therefore, it remains a question whether the conventional setting and mirrored initialization would make wide neural networks exhibit different generalization capabilities. Consequently, the generalization error of the wide neural network trained by gradient descent is (\Omega(n {-\frac{3}{d 3}})), and it still suffers from the curse of dimensionality. Thus, the NTK theory may not explain the superior performance of neural networks.
Divergence of Empirical Neural Tangent Kernel in Classification Problems
Yu, Zixiong, Tian, Songtao, Chen, Guhan
This paper demonstrates that in classification problems, fully connected neural networks (FCNs) and residual neural networks (ResNets) cannot be approximated by kernel logistic regression based on the Neural Tangent Kernel (NTK) under overtraining (i.e., when training time approaches infinity). Specifically, when using the cross-entropy loss, regardless of how large the network width is (as long as it is finite), the empirical NTK diverges from the NTK on the training samples as training time increases. To establish this result, we first demonstrate the strictly positive definiteness of the NTKs for multi-layer FCNs and ResNets. Then, we prove that during training, % with the cross-entropy loss, the neural network parameters diverge if the smallest eigenvalue of the empirical NTK matrix (Gram matrix) with respect to training samples is bounded below by a positive constant. This behavior contrasts sharply with the lazy training regime commonly observed in regression problems. Consequently, using a proof by contradiction, we show that the empirical NTK does not uniformly converge to the NTK across all times on the training samples as the network width increases. We validate our theoretical results through experiments on both synthetic data and the MNIST classification task. This finding implies that NTK theory is not applicable in this context, with significant theoretical implications for understanding neural networks in classification problems.
Is the neural tangent kernel of PINNs deep learning general partial differential equations always convergent ?
In this paper, we study the neural tangent kernel (NTK) for general partial differential equations (PDEs) based on physics-informed neural networks (PINNs). As we all know, the training of an artificial neural network can be converted to the evolution of NTK. We analyze the initialization of NTK and the convergence conditions of NTK during training for general PDEs. The theoretical results show that the homogeneity of differential operators plays a crucial role for the convergence of NTK. Moreover, based on the PINNs, we validate the convergence conditions of NTK using the initial value problems of the sine-Gordon equation and the initial-boundary value problem of the KdV equation.
On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory
Chen, Guhan, Li, Yicheng, Lin, Qian
This paper aims to discuss the impact of random initialization of neural networks in the neural tangent kernel (NTK) theory, which is ignored by most recent works in the NTK theory. It is well known that as the network's width tends to infinity, the neural network with random initialization converges to a Gaussian process $f^{\mathrm{GP}}$, which takes values in $L^{2}(\mathcal{X})$, where $\mathcal{X}$ is the domain of the data. In contrast, to adopt the traditional theory of kernel regression, most recent works introduced a special mirrored architecture and a mirrored (random) initialization to ensure the network's output is identically zero at initialization. Therefore, it remains a question whether the conventional setting and mirrored initialization would make wide neural networks exhibit different generalization capabilities. In this paper, we first show that the training dynamics of the gradient flow of neural networks with random initialization converge uniformly to that of the corresponding NTK regression with random initialization $f^{\mathrm{GP}}$. We then show that $\mathbf{P}(f^{\mathrm{GP}} \in [\mathcal{H}^{\mathrm{NT}}]^{s}) = 1$ for any $s < \frac{3}{d+1}$ and $\mathbf{P}(f^{\mathrm{GP}} \in [\mathcal{H}^{\mathrm{NT}}]^{s}) = 0$ for any $s \geq \frac{3}{d+1}$, where $[\mathcal{H}^{\mathrm{NT}}]^{s}$ is the real interpolation space of the RKHS $\mathcal{H}^{\mathrm{NT}}$ associated with the NTK. Consequently, the generalization error of the wide neural network trained by gradient descent is $\Omega(n^{-\frac{3}{d+3}})$, and it still suffers from the curse of dimensionality. On one hand, the result highlights the benefits of mirror initialization. On the other hand, it implies that NTK theory may not fully explain the superior performance of neural networks.
Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?
Seleznova, Mariia, Kutyniok, Gitta
Neural Tangent Kernel (NTK) theory is widely used to study the dynamics of infinitely-wide deep neural networks (DNNs) under gradient descent. But do the results for infinitely-wide networks give us hints about the behaviour of real finite-width ones? In this paper we study empirically when NTK theory is valid in practice for fully-connected ReLu and sigmoid networks. We find out that whether a network is in the NTK regime depends on the hyperparameters of random initialization and network's depth. In particular, NTK theory does not explain behaviour of sufficiently deep networks initialized so that their gradients explode: the kernel is random at initialization and changes significantly during training, contrary to NTK theory. On the other hand, in case of vanishing gradients DNNs are in the NTK regime but become untrainable rapidly with depth. We also describe a framework to study generalization properties of DNNs by means of NTK theory and discuss its limits.